Contents

  1. Stratified vs Random Sampling
  2. Out of memory sampling
    • Spark
    • AWS Server
    • FF
    • random.sample()
    • SQL database
    • PyTables random sampling

Two Important Methods of Random Sampling

When most people think of random sampling, they think of Simple Random Sampling. If we have 1000 observations in our sample and we only want 100, simple random sampling will pick 100 from the 1000 with equal probability.

Stratified Sampling

Why we would use Stratified Sampling

The main benefit of stratified sampling is that it increases the likelihood that our sample remains an accurate sample of the population. This property is especially valuable when collecting additional observations is expensive.

Specifying Subpopulations

When specifying subpopulations, the following conditions apply:

  1. Each subpopulation must be mutually exclsusive.
  2. All the strata should include every element of the population

With some thought, we can split the population to take account of both age and arthritis status while satisfying both the conditions above:

Once we have specified our strata, we then need to determine the appropriate size of each stratum. This requires either knowing (or reasonably well estimating) the ratios of each stratum. As this will usually require parsing the entire dataset,

Stratified Sampling In Python

Simple Random Sampling


In [ ]:

References